点击关注上方“知了小巷”,
设为“置顶或星标”,第一时间送达干货。
虽是英文版,但内容丰富细致,值得一看,先看完整目录吧,文末有获取提示。
Part I. Fundamentals of Stream Processing with Apache Spark
1. Introducing Stream Processing
What Is Stream Processing?
Batch Versus Stream Processing
The Notion of Time in Stream Processing
The Factor of Uncertainty
Some Examples of Stream Processing
Scaling Up Data Processing
MapReduce
The Lesson Learned: Scalability and Fault Tolerance
Distributed Stream Processing
Stateful Stream Processing in a Distributed System
Introducing Apache Spark
The First Wave: Functional APIs
The Second Wave: SQL
A Unified Engine
Spark Components
Spark Streaming
Structured Streaming
Where Next?
2. Stream-Processing Model
Sources and Sinks
Immutable Streams Defined from One Another
Transformations and Aggregations
Window Aggregations
Tumbling Windows
Sliding Windows
Stateless and Stateful Processing
Stateful Streams
An Example: Local Stateful Computation in Scala
A Stateless Definition of the Fibonacci Sequence as a Stream
Transformation
Stateless or Stateful Streaming
The Effect of Time
Computing on Timestamped Events
Timestamps as the Provider of the Notion of Time
Event Time Versus Processing Time
Computing with a Watermark
Summary
3. Streaming Architectures
Components of a Data Platform
Architectural Models
The Use of a Batch-Processing Component in a Streaming Application
Referential Streaming Architectures
The Lambda Architecture
The Kappa Architecture
Streaming Versus Batch Algorithms
Streaming Algorithms Are Sometimes Completely Different in Nature
Streaming Algorithms Can’t Be Guaranteed to Measure Well Against
Batch Algorithms
Summary
4. Apache Spark as a Stream-Processing Engine
The Tale of Two APIs
Spark’s Memory Usage
Failure Recovery
Lazy Evaluation
Cache Hints
Understanding Latency
Throughput-Oriented Processing
Spark’s Polyglot API
Fast Implementation of Data Analysis 50
To Learn More About Spark 51
Summary 51
5. Spark’s Distributed Processing Model
Running Apache Spark with a Cluster Manager
Examples of Cluster Managers
Spark’s Own Cluster Manager
Understanding Resilience and Fault Tolerance in a Distributed System
Fault Recovery
Cluster Manager Support for Fault Tolerance
Data Delivery Semantics
Microbatching and One-Element-at-a-Time
Microbatching: An Application of Bulk-Synchronous Processing
One-Record-at-a-Time Processing
Microbatching Versus One-at-a-Time: The Trade-Offs
Bringing Microbatch and One-Record-at-a-Time Closer Together
Dynamic Batch Interval
Structured Streaming Processing Model
The Disappearance of the Batch Interval
6. Spark’s Resilience Model
Resilient Distributed Datasets in Spark
Spark Components
Spark’s Fault-Tolerance Guarantees
Task Failure Recovery
Stage Failure Recovery
Driver Failure Recovery
Summary
Part II. Structured Streaming
7. Introducing Structured Streaming
First Steps with Structured Streaming
Batch Analytics
Streaming Analytics
Connecting to a Stream
Preparing the Data in the Stream
Operations on Streaming Dataset
Creating a Query
Start the Stream Processing
Exploring the Data
Summary
8. The Structured Streaming Programming Model
Initializing Spark
Sources: Acquiring Streaming Data
Available Sources
Transforming Streaming Data
Streaming API Restrictions on the DataFrame API
Sinks: Output the Resulting Data
format
outputMode
queryName
option
options
trigger
start()
Summary
9. Structured Streaming in Action
Consuming a Streaming Source
Application Logic
Writing to a Streaming Sink
Summary
10. Structured Streaming Sources
Understanding Sources
Reliable Sources Must Be Replayable
Sources Must Provide a Schema
Available Sources
The File Source
Specifying a File Format
Common Options
Common Text Parsing Options (CSV, JSON)
JSON File Source Format
CSV File Source Format
Parquet File Source Format
Text File Source Format
The Kafka Source
Setting Up a Kafka Source
Selecting a Topic Subscription Method
Configuring Kafka Source Options
Kafka Consumer Options
The Socket Source
Configuration
Operations
The Rate Source
Options
11. Structured Streaming Sinks
Understanding Sinks
Available Sinks
Reliable Sinks
Sinks for Experimentation
The Sink API
Exploring Sinks in Detail
The File Sink
Using Triggers with the File Sink
Common Configuration Options Across All Supported File Formats
Common Time and Date Formatting (CSV, JSON)
The CSV Format of the File Sink
The JSON File Sink Format
The Parquet File Sink Format
The Text File Sink Format
The Kafka Sink
Understanding the Kafka Publish Model
Using the Kafka Sink
The Memory Sink
Output Modes
The Console Sink
Options
Output Modes
The Foreach Sink
The ForeachWriter Interface
TCP Writer Sink: A Practical ForeachWriter Example
The Moral of this Example
Troubleshooting ForeachWriter Serialization Issues
12. Event Time–Based Stream Processing
Understanding Event Time in Structured Streaming
Using Event Time
Processing Time
Watermarks
Time-Based Window Aggregations
Defining Time-Based Windows
Understanding How Intervals Are Computed
Using Composite Aggregation Keys
Tumbling and Sliding Windows
Record Deduplication
Summary
13. Advanced Stateful Operations
Example: Car Fleet Management
Understanding Group with State Operations
Internal State Flow
Using MapGroupsWithState
Using FlatMapGroupsWithState
Output Modes
Managing State Over Time
Summary
14. Monitoring Structured Streaming Applications
The Spark Metrics Subsystem
Structured Streaming Metrics
The StreamingQuery Instance
Getting Metrics with StreamingQueryProgress
The StreamingQueryListener Interface
Implementing a StreamingQueryListener
15. Experimental Areas: Continuous Processing and Machine Learning
Continuous Processing
Understanding Continuous Processing
Using Continuous Processing
Limitations
Machine Learning
Learning Versus Exploiting
Applying a Machine Learning Model to a Stream
Example: Estimating Room Occupancy by Using Ambient Sensors
Online Training
Part III. Spark Streaming
16. Introducing Spark Streaming
The DStream Abstraction
DStreams as a Programming Model
DStreams as an Execution Model
The Structure of a Spark Streaming Application
Creating the Spark Streaming Context
Defining a DStream
Defining Output Operations
Starting the Spark Streaming Context
Stopping the Streaming Process
Summary
17. The Spark Streaming Programming Model
RDDs as the Underlying Abstraction for DStreams
Understanding DStream Transformations
Element-Centric DStream Transformations
RDD-Centric DStream Transformations
Counting
Structure-Changing Transformations
Summary
18. The Spark Streaming Execution Model
The Bulk-Synchronous Architecture
The Receiver Model
The Receiver API
How Receivers Work
The Receiver’s Data Flow
The Internal Data Resilience
Receiver Parallelism
Balancing Resources: Receivers Versus Processing Cores
Achieving Zero Data Loss with the Write-Ahead Log
The Receiverless or Direct Model
Summary
19. Spark Streaming Sources
Types of Sources
Basic Sources
Receiver-Based Sources
Direct Sources
Commonly Used Sources
The File Source
How It Works
The Queue Source
How It Works
Using a Queue Source for Unit Testing
A Simpler Alternative to the Queue Source: The ConstantInputDStream
The Socket Source
How It Works
The Kafka Source
Using the Kafka Source
How It Works
Where to Find More Sources
20. Spark Streaming Sinks
Output Operations
Built-In Output Operations
saveAsxyz
foreachRDD
Using foreachRDD as a Programmable Sink
Third-Party Output Operations
21. Time-Based Stream Processing
Window Aggregations
Tumbling Windows
Window Length Versus Batch Interval
Sliding Windows
Sliding Windows Versus Batch Interval
Sliding Windows Versus Tumbling Windows
Using Windows Versus Longer Batch Intervals
Window Reductions
reduceByWindow
reduceByKeyAndWindow
countByWindow
countByValueAndWindow
Invertible Window Aggregations
Slicing Streams
Summary
22. Arbitrary Stateful Streaming Computation
Statefulness at the Scale of a Stream
updateStateByKey
Limitation of updateStateByKey
Performance
Memory Usage
Introducing Stateful Computation with mapwithState
Using mapWithState
Event-Time Stream Computation Using mapWithState
23. Working with Spark SQL
Spark SQL
Accessing Spark SQL Functions from Spark Streaming
Example: Writing Streaming Data to Parquet
Dealing with Data at Rest
Using Join to Enrich the Input Stream
Join Optimizations
Updating Reference Datasets in a Streaming Application
Enhancing Our Example with a Reference Dataset
Summary
24. Checkpointing
Understanding the Use of Checkpoints
Checkpointing DStreams
Recovery from a Checkpoint
Limitations
The Cost of Checkpointing
Checkpoint Tuning
25. Monitoring Spark Streaming
The Streaming UI
Understanding Job Performance Using the Streaming UI
Input Rate Chart
Scheduling Delay Chart
Processing Time Chart
Total Delay Chart
Batch Details
The Monitoring REST API
Using the Monitoring REST API
Information Exposed by the Monitoring REST API
The Metrics Subsystem
The Internal Event Bus
Interacting with the Event Bus
Summary
26. Performance Tuning
The Performance Balance of Spark Streaming
The Relationship Between Batch Interval and Processing Delay
The Last Moments of a Failing Job
Going Deeper: Scheduling Delay and Processing Delay
Checkpoint Influence in Processing Time
External Factors that Influence the Job’s Performance
How to Improve Performance?
Tweaking the Batch Interval
Limiting the Data Ingress with Fixed-Rate Throttling
Backpressure
Dynamic Throttling
Tuning the Backpressure PID
Custom Rate Estimator
A Note on Alternative Dynamic Handling Strategies
Caching
Speculative Execution
Part IV. Advanced Spark Streaming Techniques
27. Streaming Approximation and Sampling Algorithms
Exactness, Real Time, and Big Data
Exactness
Real-Time Processing
Big Data
The Exactness, Real-Time, and Big Data triangle
Big Data and Real Time
Approximation Algorithms
Hashing and Sketching: An Introduction
Counting Distinct Elements: HyperLogLog
Role-Playing Exercise: If We Were a System Administrator
Practical HyperLogLog in Spark
Counting Element Frequency: Count Min Sketches
Introducing Bloom Filters
Bloom Filters with Spark
Computing Frequencies with a Count-Min Sketch
Ranks and Quantiles: T-Digest
T-Digest in Spark
Reducing the Number of Elements: Sampling
Random Sampling
Stratified Sampling
28. Real-Time Machine Learning
Streaming Classification with Naive Bayes
streamDM Introduction
Naive Bayes in Practice
Training a Movie Review Classifier
Introducing Decision Trees
Hoeffding Trees
Hoeffding Trees in Spark, in Practice
Streaming Clustering with Online K-Means
K-Means Clustering
Online Data and K-Means
The Problem of Decaying Clusters
Streaming K-Means with Spark Streaming
Part V. Beyond Apache Spark
29. Other Distributed Real-Time Stream Processing Systems
Apache Storm
Processing Model
The Storm Topology
The Storm Cluster
Compared to Spark
Apache Flink
A Streaming-First Framework
Compared to Spark
Kafka Streams
Kafka Streams Programming Model
Compared to Spark
In the Cloud
Amazon Kinesis on AWS
Microsoft Azure Stream Analytics
Apache Beam/Google Cloud Dataflow
30. Looking Ahead
Stay Plugged In
Seek Help on Stack Overflow
Start Discussions on the Mailing Lists
Attend Conferences
Attend Meetups
Read Books
Contributing to the Apache Spark Project
有5本PDF,公众号后台回复 ss
往期推荐:
Spark Core之Shuffle解析
数据模型⽆法复⽤,归根结底还是设计问题
数据仓库、数据湖、流批一体,终于有大神讲清楚了!
如何统⼀管理纷繁杂乱的数据指标?
项目管理实战20讲笔记(网易-雷蓓蓓)
元数据中⼼的关键⽬标和技术实现⽅案
Hive程序相关规范-有助于调优
HBase内部探险-数据模型
HBase内部探险-HBase是怎么存储数据的
HBase内部探险-一个KeyValue的历险
数据中台到底怎么建设呢?
到底什么样的企业应该建设数据中台?
数据中台到底是不是大数据的下一站?